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Abstract 

This paper focuses on the relation between computational learning 
theory and resource-bounded dimension. We intend to establish close 
connections between the learnability/nonlearnability of a concept class 
and its corresponding size in terms of effective dimension, which will 
allow the use of powerful dimension techniques in computational learn- 
ing and viceversa, the import of learning results into complexity via 
dimension. Firstly, we obtain a tight result on the dimension of on- 
line mistake-bound learnable classes. Secondly, in relation with PAC 
learning, we show that the polynomial-space dimension of PAC learn- 
able classes of concepts is zero. This provides a hypothesis on effective 
dimension that implies the inherent unpredictability of concept classes 
(the classes that verify this property are classes not efficiently PAC 
learnable using any hypothesis). Thirdly, in relation to space dimen- 
sion of classes that are learnable by membership query algorithms, the 
main result proves that polynomial-space dimension of concept classes 
learnable by a membership-query algorithm is zero. 



1 Introduction 

Computational learning theory studies the performance obtained and the 
resources needed in machine learning. This formalization dates back to 
the work of Valiant that in 1984 introduced the Probably Approximately 
Correct (PAC) learning model [27]. This model has been explored and 
several alternatives have been considered such as the query learning model 
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by Angluin [3] or the on-line mistake-bound learning model by Littlestone 
|19j . The main open problems in computational learning theory concern 
the limits of each learning model. We want to prove that a certain class 
of concepts is not learnable under a certain model, therefore establishing 
a lower bound in the inherent learning complexity of that class. On the 
other hand the quest for new and efficient learning algorithms is a very 
active and challenging area. This paper explores the relationship between 
computational learning theory and effective dimension with the ultimate 
goal of obtaining nonlearnability results from dimension results as well as 
translating learning algorithms into effective dimension proofs. 

Resource-bounded dimension (or effective dimension) was developed by 
Lutz [22j, as a way to overcome limitations of resource-bounded measure 
[21|. 120] . Both of them are quantitative tools in Computational Complexity 
that were introduced in order to analyze the size of complexity classes. Their 
power is witnessed by an interesting list of results both in Computational 
Complexity and Information Theory (see |114 ITU] for an updated bibliogra- 
phy). The main antecedent of this paper is the work of Watanabe et al that 
investigated the resource-bounded measure of classes that are learnable by 
PAC or equivalence query algorithms |18j . More specifically, they proved 
that i) P/poly subclasses that can be learned with PAC algorithms have 
polynomial measure if EXP ^ AM; and ii) the P/poly subclasses that 
can be learned with membership queries have polynomial measure 0. From 
these results, hypotheses were provided on the resource-bounded measure of 
circuits that imply the non-learnability of Boolean Circuits in polynomial 
time. 

On the other hand, in the context of effective dimension, Hitchcock ex- 
plored the relationship of dimension with logarithmic loss unpredictability 
in [12]. 

This paper provides new results in line with those cited above. Firstly, 
in relation to on-line mistake-bound learning, we obtain an upper bound 
of the polynomial-time dimension of concept classes that are learnable by 
on-line algorithms in exponential time and with a2 n mistakes. Moreover, 
we prove that this upper bound is optimal. Based on our results Hitchcock 
|14j has further investigated the case of subexponentially many mistakes 
and null dimension, with interesting applications [7j. Secondly, in relation 
with PAC learning, we show that the polynomial-space dimension of PAC- 
learnable classes of concepts is zero. This provides a hypothesis on effective 
dimension that implies the inherent unpredictability of concept classes, an 
interesting property in computational learning that corresponds to classes 
not efficiently PAC learnable using any hypothesis. Furthermore, there exist 
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connections between hardness results for PAC learning and constructions in 
the field of public-key cryptography |16j that can now be rephrased with 
effective dimension hypothesis. Finally, it is studied the space dimension of 
classes that are learnable by membership query algorithms. The main result 
proves that polynomial-space dimension of concept classes learnable by a 
membership-query algorithm is zero. This can be used to demonstrate that, 
for classes that are complex in the dimension sense, a complex representation 
is necessary in order to efficiently learn the class with membership queries. 

2 Preliminaries 

A string is a finite and binary sequence w G {0,1}*. The Cantor space C 
is the set of all infinite binary sequences. Let denote the length of the 
string uu and A denote the empty string. Let x[i . . . j] for < i < j denote 
the i-th through the j-th bits of x, where x E {0, 1}* U C. Let wx denote 
the concatenation of the string w and the string or sequence x. Let w Q x 
denote that w is a prefix of x. Let so, si, «2 • • • be the enumeration of {0, 1}* 
in lexicographical order and Sq, s" . . . s^W-i be the enumeration of {0, l} n . 
Each language L C {0, 1}* can be identified with its characteristic sequence 
Xl G C where 



Abusing notation, L can be seen as a language or as a characteristic se- 
quence. Also, L =n can be seen as L n {0, l} n or as the sequence L[2 n — 
1 . . . 2 n+1 — 2]. To solve the resulting ambiguity it will be used #A for the 
cardinality of a set A. 

Let A denote any of the following classes of total functions, 

pspace = {/ : {0, 1}* — > {0, 1}* [ / is computable in polynomial space} 
p = {/ : {0, 1}* — > {0, 1}* [ / is computable in polynomial time} 
p 2 = {/ : {0, 1}* -> {0, 1}* | / is computable in n (logn) ° (1) time} 
plogon = {/ : {0, 1}* — > {0, 1}* [ / is computable by an on-line machine with 
working and output space polylogarithmic in the size of the input} 

Let D be a discrete domain such as N or {0, 1}*, a function / : D — > [0, oo) 
is A- computable if there exists a function / : D x N — > Q n [0, oo) in A such 
that for all (w, n) € D x N, \ f(w, n) — f(w)\ < 2~ n (with n coded in unary 
and the output coded in binary). 
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2.1 Effective dimension 



Effective dimension was introduced by Lutz as a generalization of the classi- 
cal Hausdorff dimension and is denned in terms of s-gales [22} [T3] . Effective 
dimension initial purpose was to serve as a quantitative tool in Compu- 
tational Complexity, distinguishing complexity classes by size, but lately 
several interesting applications have been explored, including connections 
to Information Theory and compression algorithms |15|. [24"] . 

Definition. Let s € [0, oo), an s-gale is a function d : {0,1}* —> [0, oo) 
such that for all w G {0, 1}*, 

d(w) = 2- s [d(w0) +d(wl)}. 

An s-gale can be interpreted as an strategy for betting on the successive 
bits of a binary string. The fairness of the gambling game depends on s. 
The notion of success corresponds to getting unbounded capital in this game. 

Definition. Let d : {0, 1}* — > [0, oo) be an s-gale, 

1. d succeeds on a language L C {0, 1}* if 

limsupd(L[0 . . . n — 1]) = oo. 

n—t oo 

2. The success set of d is 

S°°[d] = {LC {0, 1}* | d succeeds on L}. 

Effective dimension of a set is defined as the infimum s for which exists 
an s-gale that succeeds on it. Depending on the computational resource 
that is allowed in the computation of the s-gale, different types of effective 
dimension can be defined. 

Definition. Let X C C and A € {p, p2, pspace, plogon} . The A- 
dimension of X is 

diniA(^0 = inf{s j there exists a A-computable s-gale d s.t. X C S°°[d]}. 

The choice of resource bounds p, p2, pspace and plogon is not arbitrary, since 
they are suitable for quantitative study of important time and space bounded 
complexity classes (exponential time, exponential space, and polynomial 
space, respectively) and have also natural connections to information theory 
resource-bounded notions (T5j 124"]. 
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2.2 Learning Models 

Following [5j, information is codified by the value of n boolean attributes. 
A string in {0, l} n is called an instance and the set {0, l} n is called the 
instance space. A concept c is defined as a subset of {0, l} n or equivalently, 
as a boolean function c : {0, l} n — > {0, 1}, where c(x) = 1 if x 6 c and 
c(x) = if x $l c. 

Let C n be a subset of concepts in the instance space {0,1}™. A repre- 
sentation for C n consists of a set of strings L n and a mapping o~ n from L n 
to C n that associates each string in L n with a concept in C n . A concept 
complexity measure for C n is a mapping size n from C n to N (usually the 
minimum length of the string representing the concept in the representation 
(L n , cr n )). 

For each n £ N, let C n be a set of concepts on {0, l} n , L n and <r n be 
a representation for C n , and size ra be a concept complexity measure for 
C n . Then C = {C n } n denotes a concept class and {(C n , L ra , a n , size n )} ne pj 
is called the representation class of C. Usually the representation and the 
concept complexity measure will be understood from context. 

Example 2.1 Consider the concept class /c-CNF from Valiant's original 
paper [27], for some fixed k € N. Here the representation language L n 
consists of all CNF expressions on n variables (say x\ . . . x n ) that have at 
most k literals per clause, C n consists of all c C {0, 1}™ such that c is the 
set of satisfying assignments of one of these expressions, a n maps a CNF 
expression to its set of satisfying assignments, and size n (c) is the number 
of literals in the smallest fe-CNF representation of c. 

Once the notion of a concept class and its representation have been 
formalized, the definition of the learning models used in this paper is given 
next. 

2.2.1 The Probably Approximately Correct Learning Model (PAC 
Learning) 

The probably approximately correct (PAC) learning model formalizes the 
process of learning from examples. For instance, Support Vector Machines, 
Neural Networks and Decision Trees are based in this theoretical model. 
In the PAC learning model, the learning algorithm has access to a source of 
positive and negative examples of an unknown target concept c from a fixed 
and known concept class C. The learning algorithm must approximate the 
target concept from the examples it has seen. Valiant formalizes this notion 
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in [27]. 

More formally, let C and % be two concept classes. A Probably Approxi- 
mately Correct algorithm for learning C by % is an algorithm that when 
given examples of some concept c £ C will produce as output (a representa- 
tion) of some concept h £ % that is an approximation of c in a sense made 
precise below. The class C is called the target class and H the hypothesis 
class. Equivalently, c is called the target concept and /i the hypothesis of the 
algorithm. 

Definition. Let {0, l} n be the instance space, let D be a distribution on 
it and let c be the target concept. The error of h with respect to the concept 
c and the distribution D is defined as 

error D (h,c) = Pr x&D [h(x) ^ c(x)]. 

That is, error £>(h,c) is the probability that h and c do not match in a 
randomly chosen instance according with D. Intuitively, h will be a good 
approximation of target c is errors (h, c) is small. 

Definition. Let C = {C n }neN be a concept class and let % = {"H n }neN 
be the hypothesis class. C is PAC learnable in terms of H if there exists a 
learning algorithm A such that, 

• for all n G N, 

• for every target concept c € C n , 

• for every probability distribution D on the instance space {0, l} n , 

• for all e and 5, where < e, 5 < 1, 

if algorithm A on input (n, e, 5) is given independent random examples of 
{0, l} n drawn according to D, together with the information of wether each 
example is in c, then with probability at least 1 — 5, A returns a hypothesis 
h E Tin with error £>{h, c) < e. 

Moreover, the running time of A is bounded by a polynomial in n, 1 /e, 1 /5 
and size n (c). 

C is properly PA C learnable if C is learnable in terms of C and C is .PA C 
learnable if C is PAC learnable in terms of some class H. 

The idea beyond of this definition is that the learning algorithm must 
process the examples in polynomial time, i.e. be computationally efficient 
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and must be able to produce a good approximation to the target concept 
with high probability using only a reasonable number of examples. 

Efficiency of the learning algorithm is measured with respect to relevant 
parameters: size of examples (ra), size of target concept (size n ), 1/e, and 
1/5 (see PIUE] for more details). 

Notice that the definition above involves representations. It is clear that 
for the same concept there can be very different representations. In partic- 
ular, there might be representations with very different sizes, and using a 
representation less succinct than other will take more time, if only to out- 
put it. Since the running time of the PAC learning algorithm is polynomial 
in the representation size, concepts with large representation sizes such as 
0(2") are trivially learnable in most cases. 

Example 2.2 [9 J A number of fairly sharp results have been found for the 
notion of proper PAC learnability. The following summarizes some of these 
results. The negative results are based on the complexity of theoretic as- 
sumption that RP/ NP [17] . 

1. Conjunctive concepts are properly PAC learnable [27] . but the class 
of concepts in the form disjunction of two conjunctions is not properly 
PAC learnable [TTJ, and neither is the class of existential conjunctive 
concepts on structural instance spaces with two objects. 

2. Linear threshold concepts (perceptrons) are properly PAC learnable 
on both Boolean and real- valued instance spaces [2], but the class of 
concepts in the form of the conjunction of two linear threshold concepts 
is not PAC learnable p]. 

3. Linear thresholds of linear thresholds (i.e. multilayer perceptrons with 
hidden units) are properly PAC learnable, but the class of concepts in 
the form of the disjunction of two multilayer perceptrons is not PAC 
learnable. In addition, if the weights are restricted to 1 and (but 
the threshold is arbitrary), then linear threshold concepts on Boolean 
instances spaces are not PAC learnable [17] . 

4- The classes of fc-DNF, fc-CNF and /c-decision lists are properly PAC 
learnable for each fixed k [261 [25] . 

Most of the difficulties in proper PAC learning are due to the computa- 
tional difficulty of finding a hypothesis in the particular form specified by 
the target class. For example, while Boolean threshold functions with — 1 
weights are not PAC learnable on Boolean instance spaces (unless RP=NP), 
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they are PAC learnable by general Boolean threshold functions. Similar ex- 
tended hypothesis spaces can be found for the two classes mentioned in [IJ 
above that are not properly PAC learnable. Hence, it turns out that these 
classes are PAC learnable [T71 |B] . 

2.2.2 The Query Learning Model 

In the PAC learning model, the learning algorithm can be seen as passive, 
in the sense that it does not decide which examples it will see during the 
training phase (the labeled examples are given randomly according to some 
fixed distribution). However it might be interesting to allow the learning 
algorithm to select some particular example and ask for its labeling. This is 
the idea of a membership query, introduced in the seminal paper of Valiant 
|27j and formally defined as follows: 

• Membership query, Mem(x): the input is an assignment x £ {0, l} n 
and the output is the value of the target concept c evaluated in x. 

However, there are others types of queries that might be useful for the 
learning algorithm as those introduced in Angluin's model (query learning 
model) [3]. In the original definition, the learning algorithm has access a 
fixed set of oracles (experts) that will answer specific kinds of queries about 
the target concept c. The types of queries that are consider are the following: 
membership, equivalence, subset, superset, disjointness and exhaustiveness. 
This paper will focus in learning algorithms that have access to membership 
queries or equivalence queries. 
An equivalence query is formally defined as: 

• Equivalence query, Equ(/i): the input is a representation of a concept 
h £ Tin and the output is either YES, if h is equivalent to the target 
concept c, or NO indicating that they are not equivalent. In the latter 
case, a counterexample x satisfying c{x) ^ h(x), is also returned. 

Thus, learning with membership or equivalence queries is formally defined 
as follows. 

Definition. Let C = {C n } nS N be a concept class and let % = {H n }neN 
be the hypothesis class. C is learnable in terms of % from membership 
(equivalence) queries if there exists an algorithm A such that, for every n 
and every concept c G C n , A asks membership (equivalence) queries about c 
and eventually halts and output a hypothesis h 6 T-L n that is equivalent to 
the target, i.e. for all x, c(x) = h(x). 
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C is efficiently learnable from membership or equivalence queries if the 
running time and the total number of queries made by A are bounded by a 
polynomial in n and size n (c). 

Notice again that the choice of representation is very relevant for query 
learnability, in particular representation size. A concept with a representa- 
tion size @(2 n ) is trivially learnable in most settings. 

2.2.3 The On-line Mistake-bound Learning Model 

This learning model is the on-line mistake-bound model of Littlestone |19| . 
that considers learning from examples in a situation in which the goal of the 
learner is simply to make few mistakes. In this model, there is no separate 
set of training examples. The learner attempts to predict the appropriate 
response for each example (i.e. predict if it is a negative or positive exam- 
ple), starting with the first example received. After making this prediction, 
the learner is told whether the prediction was correct, and then uses this 
information to improve its hypothesis. The learner continues to learn as 
long as it receives examples; that is, it continues to examine the informa- 
tion it receives in an effort to improve its hypothesis. The evaluation of the 
algorithm's learning behavior is made by counting the worst-case number of 
mistakes that it will make while learning a concept from a specified concept 
class. This corresponds to learning in an adversary setting for the order of 
the examples received. 

Definition. Let C = {C n } n< =n be a concept class, 

1. An on-line learning algorithm A is an algorithm which gets inputs of 
the form (n, y, h) and outputs or 1, where \y\ = n is the example to 
be predicted and h = ((xi,&i) . . . (x r ,b r )) is the history of previously 
received examples together with the corresponding correct answers. 

2. For an integer n, and a concept c £ C n , the number of mistakes made 
by A on c is defined as follows, 

Mist(n, c, A) = max#{a; | \x\ = n, A(n, x, h) ^ c(x)}. 

h 

3. The worst-case number of mistakes made by A of C n is defined as 

Mist(n, C n , A) = max Mist (n, c, ^4). 
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4. The concept class C is on-line learnable with f(n) mistakes if there 
exists an on-line algorithm A, which runs in time polynomial in its 
input length, so that for infinitely many n, Mist(n, C n , A) < f(n). 

Proposition 2.3 1 19/ If C is learnable with f{n) equivalence queries, then 
C is on-line learnable with f(n) mistakes. 

Notice that the time bounds of the learning algorithms are relevant in 
proposition 12.31 

2.3 Dimension of concept classes 

In this paper, a concept class C will have two representations: 

• The representation (L n ,a n ) used by the learning algorithm. 

• The representation given by the characteristic sequence for each length, 
that is, for L € C, n £ N, L =n is a string of length 2 n . This represen- 
tation will be used to define the dimension of a concept class. 

Notice that with this convention, the dimension of a concept class does 
not depend on the representation used for each individual concept. That is, 
we will assume the existence of some representation a n with size n bounded 
in terms of n and the existence of some learning algorithm, but the results 
will not be focused on the particular representation. 

Also, we will assume that there exist an hypothesis class H. from which 
C is learnable, but the results in this paper will not depend on the particular 
%. Thus, we will abbreviate learnable in terms ofH by learnable. 

3 Dimension and on-line mistake-bound learning 

This section focuses on the on-line mistake-bound learning model. We prove 
that the mistake bound of on-line learning provides an upper bound on the 
p-dimensions of each class. Moreover, this upper bound can be tight for 
certain classes. 

The relationship of dimension with logarithmic loss unpredictability was 
explored in [12] and is intuitively close to on-line learning when restricting 
to examples given in lexicographical order. 

Based on the results in this section (although with chronologically earlier 
publications) Hitchcock p3] has further explored the case of dimension zero 
and small mistake bounds for on-line learning, including reductions to these 
classes. 
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The main theorem in this section provides an upper bound of the p- 
dimension for concept classes that are on-line learnable with a2 n mistakes. 

Theorem 3.1 Let a < 1/2 be a p-computable number. Let C be a concept 
class that is learnable with ct2 n mistakes, then 

dim p (C) < H(a), 

where % is Shannon binary entropy, 1-L(a) = a log ^ + (1 — a) log jh^- 

Proof. Let a < 1/2 (the case a = 1/2 is trivial). We will show that for any 
s > H(a), there exists an s-gale that succeeds on C. Let e = s ~^ (a - > . 

We will use the following function h a (x) = a log - + (1 — a) log j^- . This 
is a continuous function in the range (0, 1) which takes the minimum value 
%{a) at x = a. Let 5 be such that h a (a + 5) < li(a) + e, and a + 5 < 1/2. 

Let A be an algorithm that learns C with a2 n mistakes. For each 
z G {0, 1}* of length between and 2 n we denote by h{z) the history corre- 
sponding to having received examples s'q . . . sT^., x with corresponding correct 

answers z[0] . . . z[\z\ — 1], that is, examples in lexicographical order and an- 
swers recorded in the bits of z. We define an s-gale d : {0,1}* — > [0, oo) 
recursively as follows. 

d{\) = 1. 

Let n <G N, for any w with 2 n — 1 < \w\ < 2 n+1 — 1, we define 



d(wb) 



(q + 5)2 s d(w) if A(n, h(w[2 n - 1 . . . \w\ - 1]), s\ w \) = b, 

(1 - (a + 5))2 s d{w) if A(n, h{w[2 n - 1 . . . \w\ - 1]), s\ w] ) = b. 



Notice that d can be computed in polynomial time since A works in time 
polynomial in the input length (where the input of A includes history). 
Let L <G C be a concept. Then, 

d(L[0...2 n+1 -2]) = d{L =0 ...L =n ) 



...L 



=n-l^ 



(a + 5) a2n (1 - (a + 8))^ 2n 

_ 2- h "( a + 5 ) 2 "2 2 " s d(L =0 L =n_1 
= 2 2 ™( s ~ h *( a + s ))d(L =0 L =n_1 ) 

> 2 2 "( s - n ( a ^d(L= ...L= n - r ) 

> 2^-o2 l ( S -W(°)-^) = 2 (2 n+1 -l)^ 



-,2 n s 



d(L=° ...L 



=n-l- 
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that tends to infinity with n. Therefore C C and dim p (C) < s. □ 

As a corollary classes of concepts on-line learnable with o(2 n ) mistakes 
have p-dimension 0. This corollary was later generalized by Hitchcock |14| . 

Corollary 3.2 Let C be a class of concepts on-line learnable with at most 
o{2 n ) mistakes. Then, 

dinip(C) = 0. 

The following corollary is an immediate consequence of Proposition 12.31 
and improves a result by Lindner, Schuler and Watanabe [18] . 

Corollary 3.3 // Boolean circuits are learnable in polynomial time (even 
linear exponential time) with o(2 n ) equivalences queries then the concept 
class of Boolean circuits has p-dimension 0. 

Next we prove that Theorem 13.11 is optimal by presenting a concept class 
that is on-line learnable with a2 n mistakes and has p-dimension T~L(a). 

Theorem 3.4 Let a < 1/2 be a p-computable number. There exists a con- 
cept class C a that is on-line learnable with a2 n mistakes such that dim p (C a ) = 
H{a). 

Proof. For each a consider the concept class 

C a = {L € C | Vn, #L =n < a2 n } 

and in line with the proof of Lemma 5.1. [22], we can show that dim p (C a ) = 
T~L (a). Consider the algorithm A which predicts all the time. The number 
of mistakes made by this algorithm on any concept in C a is at most a2 n . 
□ 

Our last result shows that the values of the dimension of on-line learnable 
classes with a2 n mistakes are dense in the interval [0, T~L(a)}. 

Theorem 3.5 Let a < 1/2 be a p-computable number and let f3 E [0, 7i(a)) 
be p-computable. Then, there exists a concept class C@ that is on-line learn- 
able with a2 n mistakes such that 

dimp (Cg) = /3. 
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Proof. Let /3 > and let Cp = {L € C | Vn, #L =n < 7 2 n }, where 7 is the 
smallest value such that "H{pf) = fi- Notice that, ~H(x) is a symmetric and an 
strictly increasing continuous function for x < 1/2, so 7 < a. By Theorem 
13.41 dim p (C / g) = f3 and is on-line learnable with j2 n mistakes, so it is 
on-line learnable with a2 n mistakes. 

The case (3 = holds trivially with the same definition of Cp. □ 
As the last remark, we can generalize all results in this section by using a 
weaker on-line learning model that is only required to learn when examples 
are given on lexicographical order. 

4 Dimension and PAC Learning 

This section is focused on the PAC learning model. This model was related 
to resource-bounded measure [21] by Watanabe et al [IS]. In this section 
we show that it is also related with polynomial space dimension, partially 
generalizing [18 1. 

Our main result here proves that polynomial-space dimension of concept 
classes that are learnable by a PAC algorithm is zero. This result can be 
used to demonstrate that a large class C (in the dimension sense) is not 
efficiently PAC learnable using any hypothesis class T~L (that is, in notation 
of [16], C is inherently unpredictable). 

Finally we show a stronger result for plogon and P2 as dimension resource 
bounds, but in this case, some extra hypotheses are required. 

Since the PAC learner resource bounds (time and space) depend on 
the representation size, only classes with subexponential representations 
(size n (c) € o(2 n )Vc) have a real interest. 

Theorem 4.1 Let C be a PAC learnable concept class with subexponential 
representations. Then 

dim pspace (C) = 0. 

Moreover, if there exists a PAC algorithm that runs in space 0(2 n ) with a 
number of examples £(n) verifying Y17=o^^) e °(2 n )? then 

dim pspace (C) = 0. 

Proof. The first statement is a particular case of the second one, so it is 
enough to prove the latter. 

Let A be the PAC algorithm that witnesses that C is PAC learnable. 
Let D be the uniform distribution. Let s > H(e). Let c € C n . Then, the 
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algorithm A on input (n, e, 5) outputs (with probability 1 — 5) a hypothesis 
h such that ft. e-approximates c. Then, with probability 1 — 5, 

#t, 6l o,y^)}.„ cM)St 

Let Q n be the class of possible sets of examples that .A(n, e, (5) can use, i.e. 

Qn = {QC{0,l} n |#Q<£(n)}. 

Now, let Q G Q n and w of length 2" (that is the lexicographical represen- 
tation of a concept c G C n ). Then, we say that u? is good /or ^4 with respect 
to Q if 

A c ' Q (n, e,S) = h with erro(c, /i) < e, 

where the notation A c & references the output of A when A is given as 
examples the elements of Q together with information of whether they are 
or not in c. 

Intuitively, w is good for A with respect to Q if we can learn approxi- 
mately the concept c represented by w using the examples that Q provides 
to A. 

Let B nt Q be the set of sequences of length 2 n that are good for A with 
respect to Q and let d H: Q : {0, l}- 2 " — > [0, oo) be the function defined as 
follows, 

i i \ #1^ good for A with respect to Q I w C w} 

<q[v) = mr Q • 

Notice that, by reusing space, d n ^Q is computable in space 0(2 n ). Next, 
the following function is defined by considering all Q G Q n , 

, , s SqgQ„ dn,Q{v) 

d ' M = — — ' 

Notice that d n is also computable in space 0(2 n ). Moreover, 
i) d n {\) = 1. 

ii) d n verifies d n (w0) + d n (wl) = d n (w) for all |«;| < 2™. 
Hi) If w is a sequence of length 2 n then 

where Q n {w) is defined by 

Qn{w) = {Q G Q ra | u> is good for ^4 with respect to Q}. 
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Now, we define the function d : {0, 1}* — > [0, oo) as 



d(w) = 2 s ^l[d i (w i ), 



4 = 

where w = w° . . . w n with \w % \ = 2* for all < i < n and \w n \ < 2 n . 

It's easy to see that d is an s-gale. Also, since each d 1 is computable in 
space 0(2 l ) (with i < n) and i < log \w\, d € pspace. 

Finally, we will be need the following lemma that gives an upper bound 
on the number of sequences that are good for A with respect to any Q. 

Lemma 4.2 For allnEN and Q £ Q n we have that 

where e is the error parameter in the PAC algorithm A. 

Proof. Let us see how many different hypotheses the algorithm A can return 
when using the examples provided by a fixed set of examples Q. Notice that 
each Q £ Q n verifies that j^Q < £(n), thus A can generate at most 
hypotheses. 

Fix one of those hypotheses, say h, and let h G {0, l} 2 " be its charac- 
teristic sequence. Then, we estimate the number of sequences that are an 
e-approximation of h as follows. If 

Approx(e, h) = {w E {0, l} 2 " | #{i G {0 . . . T - 1} | h\i] + w\i}} < e2 n }, 

then by Chernoff bound [6j 

c2" 



#Approx(e,h) = ^ ( T \ < 2 n ^\ 



So, for each hypothesis h there are at most sequences that e- 

approximate h and thus, for each n € N, 

#5n,Q<2^) 2 "2^). 

□ 

Now, let us see that d succeeds on C. Let L £ C and let c G C„ be the 
concept represented by L =n . On the one hand, since A(n, e, 5) returns with 
probability 1 — 5 an e-approximation of c, we have that 

#Q^ L=U )\ >i- S . (2) 
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On the other hand, by ([I]) 



dJL= n ) = V . 

Using Lemma 14.21 in the last equation we have that 

<W£ " > - Q ,5^*^-™ 

1-5 

> 



where the last inequality is obtained using (|2|). 
Therefore, for all n € N, 

n 

d(L[0...2^-2]) > 2°V n+1 ^H-± ( 



i=0 



2 .(2^-l) (l-^) n+1 



2Er=oW(^)2 l +5« 

2 ( S -H(e))(2"+ 1 -l) 



1-11 (1 -^ n+1 



2ELo«W ' 

that tends to infinity since s > %(e). Finally, e > is arbitrary and %(e) 
tends to when e — > so, for all s > 0, we can define an s-gale in pspace 
that succeeds on C. □ 
Notice that the above theorem is true for any hypothesis class that we 
may consider. So, all the positive results in Example 12.21 (both PAC learn- 
able and properly PAC learnable) can be used to obtain results on pspace- 
dimension. 



Corollary 4.3 The following classes have polynomial- space dimension zero: 

1. The class of conjunctive concepts. 

2. Linear threshold concepts (perceptrons) . In fact, it is proven in J7]/ that 
this class has also polynomial-time dimension zero. 

3. The class of concepts in the form linear thresholds of linear thresholds 
(i.e. multilayer perceptrons with hidden units). 

4- The classes of k-DNF, k-CNF and k-decision lists (for each fixed k). 
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Theorem 14. II can also be used in a negative way, obtaining the following 
strong nonlearnability result for any hypothesis class of a concept class. 

Corollary 4.4 Let C be a concept class such that dim pspace (C) 7^ 0. Then, 
C is inherently unpredictable, i.e. there is no class of hypothesis for which C 
is PAC learnable. 

We can generalize Theorem 14.11 to PAC algorithms that use a larger 
number of examples. 

Theorem 4.5 Let C be a concept class that can be learned by a PAC algo- 
rithm in space 0{2 n ) with at most a2 n examples (a < 1 p space- computable) , 
then 

dim pspacc (C) < q. 

Proof. The proof is analogous to the last theorem, just using X^iLo^W — 
a{2 n+1 - 1). Thus 

d(L[0...2 n+1 -2]) > 2 ^ H ^- a ^ 2n+1 - 1 \l-5) n+1 , 

that tends to infinity when s > H(e) + a. Finally, e > is arbitrary and 
7i(e) tends to when e — > 0. So, for all s > a we can define an s-gale in 
pspace such that succeeds on C. □ 
Finally we look at more efficient dimension versions and obtain the fol- 
lowing. 

Theorem 4.6 Let C be a concept class that can be learned by a PAC algo- 
rithm with working space and number of examples bounded by p{n), for p a 
fixed polynomial. Then 

dim p i ogon (C) = 0. 

Theorem 4.7 Let C be a concept class that can be learned by a PAC algo- 
rithm within time 2 n and with number of examples bounded by p(n), for p a 
fixed polynomial. Then 

dim P2 (C) = 0. 

Notice that the polynomial bounds in the theorems above do not imply 
a trivial representation size for either the concept class or the hypothesis, 
since the PAC algorithm output space is not likewise bounded. 



17 



5 Dimension and membership-query learning 



This section is focused on the membership-query learning model. The main 
result proves that polynomial-space dimension of concept classes that are 
learnable by a membership-query algorithm that runs in space 0(2 n ) and 
makes at most o(2 n ) queries is zero. This implies that large classes in the di- 
mension sense require long representations in order to be membership learn- 
able. Finally we show a stronger result under more restricted learnability 
conditions. 

Theorem 5.1 Let C be a class of concepts learnable with a membership- 
query algorithm that runs in space 0(2 n ) and makes at most o(2 n ) queries, 
then 

dim pspacc (C) = 0. 

Proof. 

Let A be the query learning algorithm that witnesses that C is learnable 
with o(2 n ) membership queries. Let q(n) be the maximum number of queries 
of A on input n. 

Let w be a string of length 2 n . We say that w is good for A if the 
algorithm A on input n outputs a hypothesis h equivalent to w and the 
total number of queries made by A is bounded by q(n). Let B n be the set 
of all sequences of length 2 n that are good for A. 

Notice that, if only membership queries are allowed, then the number of 
different outputs of A(n) is bounded by 2«W, so #B n < 2 q{ ^> . 

Let d n : {0, l}- 2 " — )■ [0, oo) be the function defined as follows, 

#{ w good for A | v C w} 



Notice that, by reusing space, d n can be computed in space 0(2 n ). Also, 
if w has length 2 n and is good for A, d n (w) = l/#B n . 
Now, the s-gale d : {0, 1}* — > [0, oo) is defined by 

n 

d(w) = 2 s ^Y[d l (w i ), 

i=0 

where w = w . . . w n with \w l \ = 2 % for all < i < n and |u> n | < 2 n . 

It's easy to see that d is an s-gale. Also, since each di is computable in 
space 0(2 l ) (with i < n) and i < log \w\, d G pspace. 
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Finally, let us prove that d succeeds on C. Let L G C and n G N, then 
L =n is good for ^4 and 



#B n ~ 29(") 
Thus, for all n G N, 



n 

( i(L[0...2^-2])>2^ +1 - 1 )n^ 



21& 

that tends to infinity when s > 0. □ 
If the number of queries in Theorem 15. II is allowed to be up to a2 n , then 
a is an upper bound for polynomial-space dimension of C. 

Theorem 5.2 Let C be a class of concepts learnable with a membership- 
query algorithm that runs in space 0(2 n ) and makes at most a2 n queries 
(a < 1, a pspace computable), then 

dim pspace (C) < a. 

Proof. The proof is analogous to the above theorem, just consider q(n) = 
a2 n . In this case, 

#B n <2 a2 \ 

and then, 

n 

d(L[0...2^-2]) > ^-^i 

i=o z 

2(s-a)(2 n+1 -1) 



that tends to infinite when s > a. Therefore, dim pspace (C) < a. □ 
The following theorem shows that Theorem 15.21 is optimal. 

Theorem 5.3 Let a G Q n (0, 1). There exists a concept class C a that is 
query-learnable with a2 n membership queries such that 

dim pspace (C a ) = a. 

Proof. We will use a construction from Theorem 4.3. in |23| . Let L G C 
and let L = L1L2L3 ... be a partition of L with |Lj| = a2 l . We define the 
sequence L G C as the concatenation of Lj = LjO 21 "' 1 ''' with = 2 l . 
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Let C a = {L \L £ C}, then it is clear that this class is learnable with a query- 
learning algorithm that makes a2 n queries. Notice that it is only necessary 
to query about the bits that are provided from the original sequences Lj, 
because the other bits are all zero, and these are exactly a2 n for each n. 

Let us see that dim pspace (C) = a. First, we will see that dim pspace (C) < a. 
Let s € [0, 1] and define d : {0, 1}* — > [0, oo) as follows: 

i) d(A) = 1. 

it) Let w = wq . . . w m with \wi\ = 2 % for alH < m and \w m \ < 2 m , then 

r 2 s ~ 1 (i(u;) if \w m \ < a2 m 
d{wb) = ! 2 s d{w) if jiu m j > a2 m and 6 = 0. 

{ if |u>J > a2 m and 6= 1. 

It is clear that this function is a pspace-computable s-gale. Let L € C, then 
d(L[0...2 n -2]) 



that tends to infinity when s > a, so dim pspace (C) < a. 

Let us see that dim pspace (C) > a using a gale-diagonalization technique. 
Let s < a and let d be a pspace-computable s-gale. We will recursively 
build a sequence L € C such that d does not succeed on L. Suppose that 
L[0 ... n — 1] has been built and let L[0 ... n — 1] = Lq . . . L m where \Li\ = 2* 
and \L m \ < 2 m , we define 

f / 6 if \L m \ < a2 m and d(L[0...n- l]b) < d{L[Q...n- l]b) 
LN -\ if |L m |>a2-. 

It is clear that L € C, let us see now that d does not succeed on L. In the 
best case, d wins 2 s of the current capital in the (1 — a)2 l lasts bits of any 
Li, and loses at least 2 S_1 of the capital in the other bits (this will be the 
case when d{L[0 ...n- 1]6) = d(L[0 ...n- 1]6)). So, 

m 

d[0 . . . 2™- 1 - 2] = ^ 2< 1 ' a ^2^ s ' 1 ^ aT 

i=0 
m 

i=0 



= d(L =0 . . . L= n ) = 

n 



i=0 
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that does not tend to infinity when s < a. □ 
As a corollary of Theorem 15-H we know that large classes (in the dimen- 
sion sense) require large representations in order to be membership learnable. 

Corollary 5.4 Let C be a concept class such that 

dim pspacc (C) / 0. 

Then C does not have a representation of size o(2 n ) for which it is efficiently 
learnable by membership queries. 

Proof. By definition, C is efficiently learnable if the running time and the 
total number of queries made by A are bounded by a polynomial in n and 
in size n (c). Thus, if size n G o(2 n ), the running time (and then the working 
space) is o(2 n ) and dim pspace (C) = 0, which is a contradiction. □ 
Finally, the following theorem proves that Theorem 15.11 is also true for 
plogon-dimension when polynomial bounds are required. 

Theorem 5.5 Let C be a class of concepts learnable with a membership- 
query algorithm that runs in space polynomial in n and makes at most a 
polynomial number of queries. Then, 

dim plogon (C) = 0. 

Both output space and time of the learning algorithm are not restricted in 
the theorem above. Therefore nontrival representation sizes can be learned 
under the conditions of Theorem 15.51 

The above theorem is also true for p2-dimension. 

6 Future work 

Results connecting PAC-learning with p-dimension would shed new light on 
the learnability of languages in exponential time (EXP). 
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